Today’s topic is related to Exploratory Data Analysis (EDA), which is usually the step that comes before the statistical analysis.
We’ll be using ggplot2, a widely used R package for plotting data. Other packages are also quite useful: graphics, rCharts, lattice etc. These packages do similar things, but some offer more tools. ggplot2 has a huge library of plots. Crucially, it has a very nice documentation (also, because it’s used a lot, you will find # many forums/websites about it).
This time we’ll use the english data set, from # the languageR package. There are continuous and categorical # variables in the data, and that’s useful for plotting.
NB: Last time we also had some EDA, when we used xtabs etc.
Load the packages and the english dataset, which is included in the languageR package:
library(ggplot2)
library(languageR)
library(plyr)
library(scales) # for % in plots
data(english)Let’s use a very short name for our data this time
d = englishThis dataset has a lot of columns, so let’s first create a more concise version (we’ll overwrite d).
names(d)## [1] "RTlexdec" "RTnaming"
## [3] "Familiarity" "Word"
## [5] "AgeSubject" "WordCategory"
## [7] "WrittenFrequency" "WrittenSpokenFrequencyRatio"
## [9] "FamilySize" "DerivationalEntropy"
## [11] "InflectionalEntropy" "NumberSimplexSynsets"
## [13] "NumberComplexSynsets" "LengthInLetters"
## [15] "Ncount" "MeanBigramFrequency"
## [17] "FrequencyInitialDiphone" "ConspelV"
## [19] "ConspelN" "ConphonV"
## [21] "ConphonN" "ConfriendsV"
## [23] "ConfriendsN" "ConffV"
## [25] "ConffN" "ConfbV"
## [27] "ConfbN" "NounFrequency"
## [29] "VerbFrequency" "CV"
## [31] "Obstruent" "Frication"
## [33] "Voice" "FrequencyInitialDiphoneWord"
## [35] "FrequencyInitialDiphoneSyllable" "CorrectLexdec"
columns = c("RTlexdec", "Familiarity", "Word", "AgeSubject", "LengthInLetters", "CV", "Obstruent", "Voice")
d = d[,columns]
head(d)## RTlexdec Familiarity Word AgeSubject LengthInLetters CV Obstruent
## 1 6.543754 2.37 doe young 3 C obst
## 2 6.397596 4.43 whore young 5 C obst
## 3 6.304942 5.60 stress young 6 C obst
## 4 6.424221 3.87 pork young 4 C obst
## 5 6.450597 3.93 plug young 4 C obst
## 6 6.531970 3.27 prop young 4 C obst
## Voice
## 1 voiced
## 2 voiceless
## 3 voiceless
## 4 voiceless
## 5 voiceless
## 6 voiceless
This looks good for plotting.
RTlexdec is a numeric vector of the log RT in visual lexical decision. This will be our variable of interest (response), so we’ll be plotting other variables as possible predictors of RTlexdec.
Review from last time (also related to EDA).
In what follows, create a code that explores each question. You’ll need xtabs(), prop.table() and ddply().
1. What’s the proportion of voiced and voiceless segments (initial phoneme)?
prop.table(xtabs(~Voice, d))## Voice
## voiced voiceless
## 0.4509632 0.5490368
2. Create a contingency table that includes Voice and CV. % by Voice (ignore the obvious correlation here)
prop.table(xtabs(~Voice + CV, d),1)## CV
## Voice C V
## voiced 0.9407767 0.0592233
## voiceless 1.0000000 0.0000000
3. Create a data frame that shows the mean RT and the mean Familiarity by AgeSubject and LengthInLetters. What impact do these two variables have on the mean reaction time? How about Familiarity and LengthInLetters?
ddply(d,.(AgeSubject, LengthInLetters), summarise, meanFam=mean(Familiarity), meanRT=mean(RTlexdec))## AgeSubject LengthInLetters meanFam meanRT
## 1 old 2 6.436667 6.609008
## 2 old 3 3.922692 6.650408
## 3 old 4 3.827178 6.659876
## 4 old 5 3.736024 6.658479
## 5 old 6 3.581834 6.696432
## 6 old 7 3.638333 6.717936
## 7 young 2 6.436667 6.299283
## 8 young 3 3.922692 6.423638
## 9 young 4 3.827178 6.440874
## 10 young 5 3.736024 6.439958
## 11 young 6 3.581834 6.458087
## 12 young 7 3.638333 6.465111
Documentation available at http://docs.ggplot2.org/
There are two main ways to use ggplot. The quick method (less intuitive), and the longer method (easier to remember).
qplot() is the ‘quick version’Read about it here: http://docs.ggplot2.org/0.9.3/qplot.html
This is an INCREMENTAL example: qplot(variable1, variable2, data)
Let’s plot RTlexdec as a function of Familiarity (and other vars)
qplot(Familiarity, RTlexdec, data=d)qplot(Familiarity, RTlexdec, data=d, colour=AgeSubject)qplot(Familiarity, RTlexdec, data=d, colour=AgeSubject, size=LengthInLetters)qplot(Familiarity, RTlexdec, data=d, colour=AgeSubject, size=LengthInLetters, alpha=I(0.1))qplot(Familiarity, RTlexdec, data=d, colour=AgeSubject, size=LengthInLetters, alpha=I(0.2), facets = Voice ~ Obstruent)ggplot() is the longer version (this is what we’ll be using here)Same incremental example: ggplot(data, aes(axes)) + ...
ggplot(data=d, aes(x=Familiarity, y=RTlexdec)) + geom_point()ggplot(data=d, aes(x=Familiarity, y=RTlexdec, color=AgeSubject)) + geom_point()ggplot(data=d, aes(x=Familiarity, y=RTlexdec, color=AgeSubject, size=LengthInLetters)) + geom_point()ggplot(data=d, aes(x=Familiarity, y=RTlexdec, color=AgeSubject, size=LengthInLetters)) + geom_point(alpha=0.2)ggplot(data=d, aes(x=Familiarity, y=RTlexdec, color=AgeSubject, size=LengthInLetters)) + geom_point(alpha=0.2) + facet_grid(Voice ~ Obstruent)The titles in each facet will be inherited from the factor levels. So you can change them if you wish (but this will required changing the level names).
Alpha goes from 0 to 1.
You can see that creating a plot with ggplot() is like adding components/layers.
First, you tell R the basics (data and axes). Then, you choose the method. For example, scatter plots = geom_point()
Each TYPE of plot you wish to create is generated by a geom_...():
+ geom_boxplot()+ geom_histogram()+ geom_bar()+ geom_point()+ geom_jitter()+ geom_line()+ geom_violin()+ geom_density()+ geom_text()etc.
annotation: + annotate()
pie charts: + coord_polar()
Check http://docs.ggplot2.org/ for a full list.
In fact, you can add layers that don’t necessarily make sense, but this shows you the amount of control that you have:
ggplot(data=d, (aes(x=AgeSubject, y=RTlexdec))) + geom_jitter(color='brown', alpha=0.3) + geom_boxplot() + geom_violin(alpha=0.3, fill='blue')This plot is just totally off, but it’s generated without warnings or errors. Layering is a great aspect of plotting in R, but R won’t tell you which plots make more/less sense.
Also, note that layers are built in the order you add them.
ggplot(data=d, aes(x=RTlexdec)) + geom_density(fill="blue", color="blue", alpha=0.5) + geom_rug() # You can flip your plot by adding + coord_flip():
ggplot(data=d, aes(x=RTlexdec)) + geom_density(fill="blue", color="blue", alpha=0.5) + geom_rug() + coord_flip()Whenever you want to know which arguments a geom_... takes:
?geom_...This plots the density of RTlexdec and also adds a geom_rug(), so you can see where the data points are more concentrated.
Let’s say you have a categorical variable in the x axis and a continuous variable in the y axis.
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot()What if you want to add the data points on top of the boxplot?
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter()This doesn’t look that great, though. Add transparency (alpha):
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1)Adding facets to check for interactions
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + facet_grid(~CV)geom_bar() and geom_histogram() can give you the same output:
ggplot(data=d, aes(x=Obstruent, y=(..count..)/sum(..count..))) + geom_bar() + scale_y_continuous(labels = percent_format()) + labs(x="Type of segment", y="%")ggplot(data=d, aes(x=Obstruent, y=(..count..)/sum(..count..))) + geom_histogram() + scale_y_continuous(labels = percent_format()) + labs(x="Type of segment", y="%")Adding ERROR BARS (standard error): a manual way to do it:
You probably won’t add error bars like this, so this is just to show you how you could in principle do it.
First, we need to create a function that calculates std errors:
sem = function(x){
n = sum(x,na.rm=T)/mean(x,na.rm=T)
sqrt(var(x, na.rm=T)/n)
}Don’t worry about syntax here; we’ll look into functions next time.
Now we’ll extract the mean RTlexdec by age group:
data = ddply(d,.(AgeSubject), summarise, meanRT = mean(RTlexdec))Finally, we’ll add the std errors to our new data frame.
data$SE = NA
data[1,'SE'] = sem(d[d$AgeSubject == 'old',]$RTlexdec)
data[2,'SE'] = sem(d[d$AgeSubject == 'young',]$RTlexdec)This is what our input looks like now:
data## AgeSubject meanRT SE
## 1 old 6.660958 0.002418593
## 2 young 6.439237 0.002224914
ggplot(data=data, aes(x=AgeSubject, y=meanRT)) + geom_errorbar(width = 0.2, aes(ymax = meanRT + SE, ymin = meanRT - SE)) + geom_bar(stat='identity', alpha=0.4)You can see error bars are tiny here (note the y-axis scale)
You can also only use a point + error bars, which makes more sense:
ggplot(data=data, aes(x=AgeSubject, y=meanRT)) + geom_errorbar(width = 0.2, aes(ymax = meanRT + SE, ymin = meanRT - SE)) + geom_point()Read more: http://docs.ggplot2.org/0.9.3.1/geom_errorbar.html
Whenever you have you a question on ggplot2, google it. A good place: http://stackoverflow.com/questions/tagged/ggplot2
BUT
YOU DON’T NEED TO DO IT MANUALLY:
ourPlot = ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) Let’s assign the basic bit of the plot to a variable to be concise
ourPlot + stat_summary(fun.y = sd, geom = "point", position="dodge", size=4, shape=5)## ymax not defined: adjusting position using y instead
ourPlot + stat_summary(fun.y = mean, geom = "point", position="dodge", size=4, shape=5)## ymax not defined: adjusting position using y instead
To get a shape chart, visit: http://www.cookbook-r.com/Graphs/Shapes_and_line_types/
You can also simply use a letter (only ONE), but use "":
ourPlot + stat_summary(fun.y = mean, geom = "point", position="dodge", size=10, shape="A")## ymax not defined: adjusting position using y instead
ourPlot + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width=0.1)ourPlot + stat_summary(fun.data = mean_cl_normal, geom = "pointrange", width=0.1) You can add the mult argument to stat_summary, which controls the No. of sd from the mean you want to plot.
ourPlot + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width=0.1, mult=5)How would you create the same plot as a function of the CV variable?
ourPlot + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width=0.1, mult=5) + facet_grid(~CV)As you can see, stat_summary() is incredibly useful.
Note: Many of the examples that follow wouldn’t normally be used.
Let’s say you want to plot the actual words (i.e., values of a var). You also want to add a green (!) trend line.
ggplot(data=d[d$LengthInLetters==3,], aes(x=Familiarity, y=RTlexdec)) + geom_text(aes(label=Word), alpha=0.5) + geom_smooth(color="green", size=2)## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
geom_smooth() also includes the standard error (gray area). You can remove it, though.
ggplot(data=d[d$LengthInLetters==3,], aes(x=Familiarity, y=RTlexdec)) + geom_text(aes(label=Word), alpha=0.5) + geom_smooth(color="green", size=2, se=F)## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
You can also use other statistical methods to define the trend line. For example, let’s say you want to use a (linear) regression line:
ggplot(data=d[d$LengthInLetters==3,], aes(x=Familiarity, y=RTlexdec)) + geom_text(aes(label=Word), alpha=0.5) + geom_smooth(color="green", size=2, method='lm')What if you want to write on your plot?
ggplot(data=d[d$LengthInLetters==3,], aes(x=Familiarity, y=RTlexdec)) + geom_text(aes(label=Word), alpha=0.5) + geom_smooth(color="green", size=2) + annotate("text", x=6, y=7, label="Test", fontface="bold", size=9, color="red")## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
So you can provide x and y coordinates, fontsize, color etc.
Visit http://docs.ggplot2.org/0.9.2.1/theme.html
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age")Visit http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/
The package {RColorBrewer} is quite impressive, so if you’re into colors, check it out.
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1, color="blue") + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age")ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.2, aes(color=Voice)) + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age")You can always change the colors, of course (you can even create your own)
Everything about ggplot can be formatted. You might want to have a white background (better for publication):
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age") + theme_bw()You can play around with the arguments in geom_boxplot().
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot(fill="lightblue") + geom_jitter(alpha=0.1) + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age") + theme_bw()Colors are usually avoided, but sometimes they can help a lot.
IMPORTANT:
geom_boxplot(aes(THESE ARGUMENTS)) refer to the data
geom_boxplot(aes(), THESE ARGUMENTS) refer to independent parameters
For example, if you want the plot to be red, use (2). If you want colors to represent the age of the subjects, use (1).
Again, there are several ways to change the distances in your plot. The easiest (simplest) way to add more dist. between labels and the plot is by adding \n to your strings.
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot(width=0.4, size=1) + geom_jitter(alpha=0.1) + ggtitle("My plot\n") + ylab("log(RT)\n") + xlab("Subject age\n") + theme_bw()You can also use \n to break lines in labels
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + ggtitle("My title\nhas two lines\n") + ylab("log(RT)\n") + xlab("Subject age\n") + theme_bw()If you google ‘spacing in ggplot’, you’ll find hundreds of pages.
If you need something more complete, check the theme() function:
If you don’t like R fonts, you’ll need to install a package and then load it. Once you’ve done that, you can use different font faces INSIDE theme().
library(extrafont)## Registering fonts with R
You don’t have to do that to change font size, though.
IMPORTANT: Remember that the font sizes interact with the plot window resolution when you save your plot.
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + ggtitle("My title\nhas two lines\n") + ylab("log(RT)\n") + xlab("Subject age\n") + theme_bw() + theme(text=element_text(size=20, family="CMU Sans Serif"))